Skip to content

fix(qwp): fix missed terminal send errors during Sender close#42

Open
jerrinot wants to merge 8 commits into
mainfrom
jh_close_error_race
Open

fix(qwp): fix missed terminal send errors during Sender close#42
jerrinot wants to merge 8 commits into
mainfrom
jh_close_error_race

Conversation

@jerrinot

@jerrinot jerrinot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Fixes a race where QwpWebSocketSender.close() could return successfully even after a terminal QWP/WebSocket error was latched during close, and a related corner where close() could rethrow a wrapped terminal error the user had already caught from flush().

implementation details for reviewers

  • Track the exact latched instance that checkError() has synchronously surfaced and make close() suppress only that pre-owned instance, taken as the single read getSynchronouslySurfacedError(). This avoids treating a freshly latched terminal as already owned when it appears between two separate latch reads. Guarded by CloseOwnershipRaceTest (probabilistic accessor race, near-certain catch per CI run, ~0.15s).
  • Simplify the latch so its invariants live in one place: recordFatal wraps non-LineSenderException causes once at latch time, so every rethrow delivers the same instance — close()'s identity suppression becomes correct in every state, which also fixes a pre-existing corner where drainOnClose() re-threw a fresh wrapper for an error the user already caught. checkUnsurfacedError() encapsulates the close() safety net and hasUnsurfacedError() becomes private, so the racy two-read snapshot is no longer expressible outside the class. The typed SenderError payload is derived from the latched LineSenderServerException instead of being tracked in a sibling field.
  • Rename the latch field lastError to terminalError (and getLastError() to getTerminalError()): the field is a write-once, first-writer-wins latch — transient failures reconnect upstream and never reach it — so "last" was misleading.
  • CloseSafetyNetTest pins both branches of close()'s safety net deterministically: a config-string-only caller who never flushed must see the latched terminal thrown from close(); a caller whose custom error handler already received it must get a silent close.

Fixes a race where QwpWebSocketSender.close() could return successfully
even after a terminal QWP/WebSocket error was latched during close.

**implementation details for reviewers**
Track the exact `lastError` instance that `checkError()` has synchronously surfaced
and make `QwpWebSocketSender.close()` suppress only that pre-owned instance.
This avoids treating a freshly latched terminal error as already owned when it appears
between separate `hasUnsurfacedError()` and `getLastError()` reads.
@jerrinot jerrinot added the bug Something isn't working label Jun 9, 2026
jerrinot and others added 7 commits June 11, 2026 14:15
The latch leaked its invariants into callers: close() had to compose
hasUnsurfacedError() + checkError() for the safety-net rethrow, and
checkError() minted a fresh wrapper per call for raw causes, so
drainOnClose() could rethrow an error the user had already caught.

- recordFatal() wraps non-LineSenderException causes once, at latch
  time: every rethrow delivers the same instance, making close()'s
  identity suppression correct in every state.
- checkUnsurfacedError() encapsulates the close() safety net;
  hasUnsurfacedError() becomes private, so the torn two-read
  ownership snapshot is no longer writable outside the class.
- getLastTerminalServerError() derives the SenderError from the
  latched LineSenderServerException; the sibling field is gone.
No sender-level test asserted WHETHER close() throws: the existing
close assertions (InitialConnectAsyncTest) install a handler and
tolerate both outcomes, so deleting the safety net, inverting its
handler gate, or always-suppressing the snapshot all survived the
suite.

CloseSafetyNetTest adds the strict pair, each awaiting the latched
terminal deterministically before close():

- a config-string-only caller who never flushed must get the typed
  terminal thrown from close();
- a caller whose custom handler already received the error must get
  a silent close().

All three mutations above now fail the suite. The snapshot race
itself stays covered by CloseOwnershipRaceTest at the accessor.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
lastError suggested most-recent-wins, but the field is a write-once
first-writer-wins latch: transient failures reconnect upstream and
never reach it; only the error that ends the loop is latched. The
name now matches the documented invariant and the "latched terminal"
language used throughout. recordFatal's javadoc also states the
single-writer rule explicitly: the unsynchronized check-then-latch
is only safe because every caller runs on the I/O thread. The
retry-loop locals keep the lastError name — there it is accurate.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@mtopolnik

Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 24 / 28 (85.71%)

file detail

path covered line new line coverage
🔵 io/questdb/client/cutlass/qwp/client/sf/cursor/CursorWebSocketSendLoop.java 20 24 83.33%
🔵 io/questdb/client/cutlass/qwp/client/QwpWebSocketSender.java 4 4 100.00%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants